This report shows the QC steps for gene expression microarry data from GEO study, including:

Mannually change the variables for GEO ID (geo_id), data directory (datadir) and result directory (resdir).

# GEO id
geo_id="GSE4917"
# directory stores GEO data
datadir="data"
# directory stores generated files
resdir="results"

Note that three variables, platform (platform), geo_GPL (GPL id for analysis if the samples in the study were scanned on multiple platforms) and normdata (whether the expression matrix is normalized), need to be manually re-defined in the following steps after look into the datasets. A shortname_func function is suggested to be updated.

platform="Affymetrix"
geo_GPL=""
normdata=FALSE
# The shortname_func function shortens the sample name shown in the plots. To start, define shortname_func <- function(x){x}
shortname_func <- function(x){gsub("^(.*)\\.(cel|CEL).gz","\\1",x)}

Install the prerequisite R packages if they do not exist

Load the necessary libraries. Load affy and dplyr packages later since they will mask other functions.

GEO Data Download and Phenotype Preparation

GEO Dataset Download

Download the GEO series matrix files if available.

Show expression dataset features

## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 22215 features, 24 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: GSM109204 GSM109205 ... GSM109227 (24 total)
##   varLabels: title geo_accession ... Genotype:ch1 (37 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: 1007_s_at 1053_at ... 91952_at (22215 total)
##   fvarLabels: ID GB_ACC ... Gene Ontology Molecular Function (16
##     total)
##   fvarMetadata: Column Description labelDescription
## experimentData: use 'experimentData(object)'
## Annotation: GPL96

Manual inspection: Re-define the platform variable “platform” (i.e. “Affymetrix”, “Agilent”, “Illumina”).

Modify raw phenotype information

Obtain raw phenotype information from the GEO dataset and generated a summary of all the phenotypic variables for overview.

For continuous variables, show the summary table. For categorical variables, only show the first five levels of variables.

Generate a variable, suppldata (whether supplementary data are available), based on whether the column supplementary_file is none.

title
title counts
Control(ethanol)-treated sample at 24hr, biological rep1 1
Control(ethanol)-treated sample at 24hr, biological rep2 1
Control(ethanol)-treated sample at 24hr, biological rep3 1
Control(ethanol)-treated sample at 2hr, biological rep1 1
Control(ethanol)-treated sample at 2hr, biological rep2 1
geo_accession
geo_accession counts
GSM109204 1
GSM109205 1
GSM109206 1
GSM109207 1
GSM109208 1
status
status counts
Public on Jun 02 2006 24
submission_date
submission_date counts
May 17 2006 24
last_update_date
last_update_date counts
Apr 24 2013 24
type
type counts
RNA 24
channel_count
channel_count counts
1 24
source_name_ch1
source_name_ch1 counts
MCF10A-Myc cells treated with dexamethasone for 24hr 3
MCF10A-Myc cells treated with dexamethasone for 2hr 3
MCF10A-Myc cells treated with dexamethasone for 30 min 2
MCF10A-Myc cells treated with dexamethasone for 30min 1
MCF10A-Myc cells treated with dexamethasone for 4hr 3
organism_ch1
organism_ch1 counts
Homo sapiens 24
characteristics_ch1
characteristics_ch1 counts
Genotype: unknown 24
characteristics_ch1.1
characteristics_ch1.1 counts
Age: 35 24
treatment_protocol_ch1
treatment_protocol_ch1 counts
MCF10A-Myc cells were subjected to 72 hrs of growth factor withdrawal followed by treatment with dexamethasone (Dex) for 24hr. 1
MCF10A-Myc cells were subjected to 72 hrs of growth factor withdrawal followed by treatment with dexamethasone for 2 hr. 2
MCF10A-Myc cells were subjected to 72 hrs of growth factor withdrawal followed by treatment with dexamethasone for 24 hr. 2
MCF10A-Myc cells were subjected to 72 hrs of growth factor withdrawal followed by treatment with dexamethasone for 2hr. 1
MCF10A-Myc cells were subjected to 72 hrs of growth factor withdrawal followed by treatment with dexamethasone for 30 min. 3
molecule_ch1
molecule_ch1 counts
total RNA 24
extract_protocol_ch1
extract_protocol_ch1 counts
Total RNA from each sample was extracted using Qiagen’s RNeasy Kit. 24
label_ch1
label_ch1 counts
biotin 24
label_protocol_ch1
label_protocol_ch1 counts
Biotinylated cRNA were prepared according to the standard Affymetrix protocol from 6 microg total RNA (Expression Analysis Technical Manual, 2001, Affymetrix). 24
taxid_ch1
taxid_ch1 counts
9606 24
hyb_protocol
hyb_protocol counts
Following fragmentation, 10 microg of cRNA were hybridized for 16 hr at 45C on Affymetrix human genome HG-U133A genechip. GeneChips were washed and stained in the Affymetrix Fluidics Station 450. 24
scan_protocol
scan_protocol counts
GeneChips were scanned using the Affymetrix GeneChip® Scanner 3000 7G. 19
GeneChips were scanned using the Affymetrix GeneChip® Scanner 3000 7G.. 5
description
description counts
Gene expression data from embryos in slow phase of cellularisation. 1
Gene expression data from MCF10A-Myc cells treated with dex for 2 hr. 1
Gene expression data from MCF10A-Myc cells treated with Dex for 2 hr. 1
Gene expression data from MCF10A-Myc cells treated with dex for 24 hr. 1
Gene expression data from MCF10A-Myc cells treated with Dex for 24 hr. 1
data_processing
data_processing counts
The data were analyzed and normalized Robust Multi-array Average (RMA) algorithm. 24
platform_id
platform_id counts
GPL96 24
contact_name
contact_name counts
Suzanne,Daniela,Conzen 24
contact_email
contact_email counts
sconzen@medicine.bsd.uchicago.edu 24
contact_phone
contact_phone counts
(773)834-2604 24
contact_laboratory
contact_laboratory counts
Conzen Lab 24
contact_department
contact_department counts
Medicine 24
contact_institute
contact_institute counts
The University of Chicago 24
contact_address
contact_address counts
5841 S. Maryland Ave, SBRI J301, MC 2115 24
contact_city
contact_city counts
Chicago 24
contact_state
contact_state counts
IL 24
contact_zip/postal_code
contact_zip/postal_code counts
60637 24
contact_country
contact_country counts
USA 24
supplementary_file
supplementary_file counts
ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM109nnn/GSM109204/suppl/GSM109204.cel.gz 1
ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM109nnn/GSM109205/suppl/GSM109205.cel.gz 1
ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM109nnn/GSM109206/suppl/GSM109206.cel.gz 1
ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM109nnn/GSM109207/suppl/GSM109207.cel.gz 1
ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM109nnn/GSM109208/suppl/GSM109208.cel.gz 1
data_row_count
data_row_count counts
22215 24
Age:ch1
Age:ch1 counts
35 24
Genotype:ch1
Genotype:ch1 counts
unknown 24

Manual inspection: Automatically define the variable “suppldata” (i.e. TRUE or FALSE) showing whether supplementary data is available. Note that samples with supplementary file column equals “None” are excluded from analysis.

Defined suppldata variable

## suppldata = TRUE

These raw phenotypic variables are not informative (e.g. description, characteristics_ch1 and source_name_ch1) and not created in a consice way. Select useful phenotype variables and manually modify them using a standard format (e.g. GEO_ID, Donor, Disease, Treatment, Age, Gender). This step requires mannual inspection.

Raw intensity data download

Download supplementary raw data files if available. Analysis using raw intensity data is only for data from Affymetrix platform. For other platforms, the expression matrix is derive from ExpressionSet object (gse object from GEOquery), but the batch (scan date) information is obtained from the supplementary files and used for differential expression analysis.

For data from Affymetrix platform, the raw.data object is generated from importing supplementary raw data files (usually .cel files) using R oligo package. Scan date information is derived from the raw.data for batch effect adjustment.

For data from Agilent platform, the raw.data object is derived from GEO expression matrix. Scan date information is derived from the raw data files (usually in the 3rd line of a .txt file) for batch effect adjustment.

If supplementary data is not available, the raw.data object is derived from GEO expression matrix. Scan date information is unknown.

Assign phenotype data to raw data object.

Show the summary of phenotype variables and the sample size for different groups

Show the first 5 rows of the modified phenotype file
GEO_ID Donor Sample Tissue treatment_time treatment_drug Treatment Disease Age ScanDate_Group
GSM109204 rep1 GSM109204_rep1 MCF10A-Myc 30min baseline baseline_30min healthy 35 05/23/02
GSM109205 rep1 GSM109205_rep1 MCF10A-Myc 30min dex dex_30min healthy 35 05/23/02
GSM109206 rep1 GSM109206_rep1 MCF10A-Myc 2hr baseline baseline_2hr healthy 35 05/23/02
GSM109207 rep1 GSM109207_rep1 MCF10A-Myc 2hr dex dex_2hr healthy 35 05/23/02
GSM109208 rep1 GSM109208_rep1 MCF10A-Myc 4hr baseline baseline_4hr healthy 35 05/23/02
Sample size in different tissue and disease/treatment groups
Tissue Disease Treatment Count
MCF10A-Myc healthy baseline_24hr 3
MCF10A-Myc healthy baseline_2hr 3
MCF10A-Myc healthy baseline_30min 3
MCF10A-Myc healthy baseline_4hr 3
MCF10A-Myc healthy dex_24hr 3
MCF10A-Myc healthy dex_2hr 3
MCF10A-Myc healthy dex_30min 3
MCF10A-Myc healthy dex_4hr 3
Sample size in different batch
ScanDate_Group Count
04/13/05 8
05/23/02 7
05/31/02 8
06/03/02 1

Assign colors to scan date or disease/treatment if scan date is not available.

If gene expression matrix data is used, check if they are normalized/log-transformed. After manual inspection, assign a logistic variable “normdata” (whether needs log2 transformation/normalization or not for QC). If normdata is FALSE, we generate boxplots for log2-transformed and Quantile-normalization of log2-transformed data. Note that if the data are normalized, it is not likely to detect the outliers based on the intensity metrices.

If negative/zero intensity values are present, convert them to NAs.

Manual inspection: Re-define the variable “normdata” (i.e. TRUE or FALSE) showing whether the expression data is normalized or not.

Quality Control for Microarray Data

The major QC steps and scoring methods for outliers were adapted from arrayQualityMetrics. The threshold to determine an outlier used in arrayQualityMetrics is the boxplot’s upper whisker, i.e. values beyond 1.5 times the interquartile range, which is also applied to our pipeline. The following QC metrics are included in a routine analysis. The QC metrics used for outlier detection are marked with an asterisk.

All the above steps can be processed in data from Affymetrix gene expression array. For data from other platforms, metrics for “raw”" proble intensities, MA plots, heatmap for array distance and PCAs can be processed.

Use the prepared phenotype file.

Raw Probe Intensity Boxplots and Density Histograms

The log2-transformed/normalized intensity distributions of all samples (arrays) are expected to have the similar scale (i.e. the similar positions and widths of the boxes). Outlier detection is applied by computing a Kolmogorov-Smirnov statistic (Ka) between log-intensity distribution for one array and the pooled array data, where an array with a Ka beyond the upper whisker is designated as an outlier.

  1. Outlier detection for log2 raw probe intensity/normalized intensity

Compute the Kolmogorov-Smirnov statistic Ka between each array’s (i.e. sample) values (i.e. log2 transformed raw probe intensity values) and the pooled, overall distribution of the values.

## 0 outlier(s) are detected in the raw intensity metrics.
  1. Boxplots for log2 raw probe intensity

  1. Density curves for log2 raw probe intensity

The intensity curves of all samples (arrays) are expected to have the similar shapes and ranges. Samples with deviated curves are likely to have problematic experiments. For example, high levels of background will shift an array’s distribution to the right. Lack of signal diminishes its right right tail. A bulge at the upper end of the intensity range often indicates signal saturation.

RNA Digestion

Overall RNA quality can be assessed by RNA degradation plots. In the gene expression array, each probe is represented by a probe set. Each probe set is 11-20 probes (pairs of oligos). This plot shows the average intensity of each probe across all probe sets, ordered from the 5´ to the 3´ end. It is expected that probe intensities are lower at the 5´ end of a probe set when compared to the 3´end as RNA degradation starts from the 5´end of a molecule. RNA which is too degraded will shows a very high slope from 5´ to 3´. Thus, the standardized slope of the RNA degradation plot serves as quantitative indicator of the RNA degradation.

This step requires R package affy that outputs the probes in each probe set matrix ordered from 5’ to 3’, while this function is not implemented in the oligo package.

Distribution of Perfect Match (PM) and Mismatch (MM)

There are two paired probe types: perfect match (PM) and of mismatched (MM) probes. A PM probe matches a strand of cDNA, while the corresponding MM probe differs from the PM by a change in the central nucleotide. A probe set is called present if the intensity value of PM is significantly larger than MM. However, the Affymetrix approach is under attack because between 15%-30% of the MM are greater than the PM. For some newer arrays, MM probes are not used. If the number of PMs is not equal to that of MMs, this might be a PM-only array.

If both PM and MM are present, the density curves of log2 PM ana MM intensities are generated, where MM probes are expected to have smaller log2-intensity at the peak than PM probes due to their nonspecific hybridization.

MA Plots

MA plots allow pairwise comparison of the log-intensity of each array to a reference array and identification of intensity-dependent biases.

The y-axis of the MA-plot shows the log-ratio intensity of one array to the reference median array, which is called M (minus). M = log2(I1)-log2(I2) (I1: the intensity of the array studied; I2: the median intensity across arrays)

The x-axis indicates the average log-intensity of both arrays, which is called A (add). A = 1/2*(log2(I1)+log2(I2))

It is expected that the probe levels do not differ systematically within a group of replicates, so that the MA-plot is centered on the y-axis (y=0 or M=0) from low to high intensities.

  1. Outlier detection for MA plots

Outlier detection is applied by computing a Hoeffding’s statistic (Da) on the joint distribution of A and M for each array, where an array with a Da >0.15 is designated as an outlier.

## 2 outliers are detected in the MA metrics.
## They are:  GSM109212.cel.gz GSM109217.cel.gz
  1. MA plots

MA plots of the samples with the 4 highest and lowest Hoeffding’s statistics.

Spatial Distribution

Spatial plots show an artificial colored image of an array’s spatial distribution of intensities that indicate spatial variation in an array. Log-intensities of probes are plotted by their corresponding spatial x and y-coordinate in the array and are expected to be uniformly distributed if the array data has good quality. The rank scale is applied for plotting as it has the potential to amplify patterns that are small in amplitude but systematic within an array.

The affy package is required to obtain the AffyBatch object that contains information of spatial x- and y-coordinate, while this function is not implemented in the oligo package.

  1. Outlier detection for spatial plots

Outlier detection is applied by computing a sum of the absolute values of low frequency Fourier coefficients (Fa) across all probe sets for each array, where an array with a Fa beyond the upper whisker is designated as an outlier.

## 1 outlier(s) are detected in the spatial metrics.
## They are:  GSM109215.cel.gz
  1. Spatial distribution plots

Spatial distribution plots of samples with the 4 highest and lowest values of Fa. The value of Fa for each sample is shown in the panel headings. Outliers marked with * have Fa values of large scale spatial structures.

Relative Log Expression (RLE) Distribution

The normalized unscaled standard error (NUSE) and relative log expression (RLE) boxplots indicate probe set homogeneity in one array, where the metrics are derived from a fitted probe level model by the fitProbeLevelModel function (oligo). The RLE plots represent the distribution of the ratio between the log-intensity of a probe set and the median log-intensity of the corresponding probe set across all arrays, expected to be centered near 0, as a log scale is applied. Outlier detection is applied by computing a Kolmogorov-Smirnov statistic (Ra) between RLE distribution for one array and the pooled array data, where an array with a Ra beyond the upper whisker is designated as an outlier

  1. Outlier detection for RLE

Compute the Kolmogorov-Smirnov statistic Ra between each array’s (i.e. sample) values (i.e. relative log expression values) and the pooled, overall distribution of the values. Detect outliers that are deviated from the threshold.

## 2 outlier(s) are detected in RLE metrics.
## They are:  GSM109210.cel.gz GSM109215.cel.gz
  1. Boxplot for RLE

Use boxplot_func function to plot RLE. Outliers marked with * have values centered away from 0 and/or are more spread out are potentially problematic.

Normalized Unscaled Standard Error (NUSE) Outlier Detection and Plots

The NUSE plots show the distribution of normalized standard error estimates, expected to be centered near 1. Outlier detection is applied by computing an upper hinge (Na) across all probe sets for each array, where an array with a Na beyond the upper whisker is designated as an outlier.

  1. Outlier detection for NUSE

Compute 75% quantile Na of each array’s NUSE values Detect outliers that have larger Na deviated from the threshold.

2 outlier(s) are detected in NUSE metrics. They are: GSM109210.cel.gz GSM109215.cel.gz

  1. Boxplot for NUSE

Use boxplot_func function to plot RLE. Outliers marked with * have values centered away from 0 and/or are more spread out are potentially problematic.

Distance between Samples and Outlier Detection

Distance between arrays is evaluated using mean absolute difference of log-intensity/normalized intensity between each pair of arrays, where the hierarchical tree between arrays is created based on the distance, which is visualized by a heatmap and dendrogram.

The distance d(ab) between two arrays a and b is computed as the mean absolute difference (L1-distance) between the data of the arrays (using the data from all probes without filtering). In the formula (the dist2 function from genefilter package), d(ab) = mean | M(ai) - M(bi) |, where M(ai) is the value of the i-th probe on the a-th array.

  1. Outlier detection for sample distance

Outlier detection is applied by computing the sum of the distances of one array to all other arrays (Sa) (Sa=Sum(b)d(ab)), where an array with a Sa beyond the upper whisker is designated as an outlier.

## 0 outlier(s) are detected in sample distance metrics.
  1. Plot distance between samples

Principal Component Analysis (PCA)

PCA demonstrates information of the expression dataset in a reduced number of dimensions. Clustering and PCA plots enable to assess to what extent arrays resemble each other, and whether this corresponds to the known resemblances of the samples.

  1. Compute PCs and variance explained by the first 10 PCs
Variance explained
PC Proportion of Variance (%) Cumulative Proportion of Variance (%)
PC1 85.75 85.75
PC2 5.7 91.45
PC3 2.186 93.64
PC4 1.211 94.85
PC5 0.979 95.83
PC6 0.64 96.47
PC7 0.4505 96.92
PC8 0.4126 97.33
PC9 0.3629 97.7
PC10 0.3118 98.01
  1. PCA plots

PCA plots are generated using the first two principle components colored by known factors (e.g. treatment/disease conditions, tissue, donors and scan dates), visualizing similarities between arrays and these similarities’ correlation to batch effects.

QC Summary

The summary of outliers and detection methods

Outlier Summary
  Frequency Method
GSM109212.cel.gz 1 MA
GSM109217.cel.gz 1 MA
GSM109210.cel.gz 2 NUSE, RLE
GSM109215.cel.gz 3 NUSE, RLE, spatial

Create a new column “QC_Pass” in the phenotype file. By default, samples detected as an outlier more than twice are assigned to 0 otherwise to 1.

## 1 outlier(s) are defined.
## They are:  GSM109215
Summary of samples without QC
Tissue Treatment Disease Counts
MCF10A-Myc baseline_24hr healthy 3
MCF10A-Myc baseline_2hr healthy 3
MCF10A-Myc baseline_30min healthy 3
MCF10A-Myc baseline_4hr healthy 3
MCF10A-Myc dex_24hr healthy 3
MCF10A-Myc dex_2hr healthy 3
MCF10A-Myc dex_30min healthy 3
MCF10A-Myc dex_4hr healthy 3
Summary of samples with QC
Tissue Treatment Disease Counts
MCF10A-Myc baseline_24hr healthy 3
MCF10A-Myc baseline_2hr healthy 3
MCF10A-Myc baseline_30min healthy 3
MCF10A-Myc baseline_4hr healthy 3
MCF10A-Myc dex_24hr healthy 3
MCF10A-Myc dex_2hr healthy 2
MCF10A-Myc dex_30min healthy 3
MCF10A-Myc dex_4hr healthy 3

Session information

R version 3.4.4 (2018-03-15)

**Platform:** x86_64-w64-mingw32/x64 (64-bit)

locale: LC_COLLATE=English_United States.1252, LC_CTYPE=English_United States.1252, LC_MONETARY=English_United States.1252, LC_NUMERIC=C and LC_TIME=English_United States.1252

attached base packages: stats4, parallel, stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: hgu133acdf(v.2.18.0), pd.hg.u133a(v.3.12.0), DBI(v.0.8), RSQLite(v.2.0), pander(v.0.6.1), preprocessCore(v.1.40.0), devtools(v.1.13.5), Hmisc(v.4.1-1), Formula(v.1.2-2), survival(v.2.41-3), lattice(v.0.20-35), gplots(v.3.0.1), ggplot2(v.2.2.1), viridis(v.0.5.1), viridisLite(v.0.3.0), oligo(v.1.42.0), Biostrings(v.2.46.0), XVector(v.0.18.0), IRanges(v.2.12.0), S4Vectors(v.0.16.0), oligoClasses(v.1.40.0), GEOquery(v.2.46.15), Biobase(v.2.38.0) and BiocGenerics(v.0.24.0)

loaded via a namespace (and not attached): bitops(v.1.0-6), matrixStats(v.0.53.1), bit64(v.0.9-7), RColorBrewer(v.1.1-2), rprojroot(v.1.3-2), GenomeInfoDb(v.1.14.0), tools(v.3.4.4), backports(v.1.1.2), R6(v.2.2.2), affyio(v.1.48.0), rpart(v.4.1-13), KernSmooth(v.2.23-15), lazyeval(v.0.2.1), colorspace(v.1.3-2), nnet(v.7.3-12), withr(v.2.1.2), tidyselect(v.0.2.4), gridExtra(v.2.3), bit(v.1.1-12), compiler(v.3.4.4), htmlTable(v.1.11.2), xml2(v.1.2.0), DelayedArray(v.0.4.1), labeling(v.0.3), checkmate(v.1.8.5), caTools(v.1.17.1), scales(v.0.5.0), readr(v.1.1.1), stringr(v.1.3.0), digest(v.0.6.15), foreign(v.0.8-69), rmarkdown(v.1.9), base64enc(v.0.1-3), pkgconfig(v.2.0.1), htmltools(v.0.3.6), limma(v.3.34.9), htmlwidgets(v.1.0), rlang(v.0.2.0), rstudioapi(v.0.7), BiocInstaller(v.1.28.0), bindr(v.0.1.1), gtools(v.3.5.0), acepack(v.1.4.1), dplyr(v.0.7.5), RCurl(v.1.95-4.10), magrittr(v.1.5), GenomeInfoDbData(v.1.0.0), Matrix(v.1.2-12), Rcpp(v.0.12.16), munsell(v.0.4.3), stringi(v.1.1.7), yaml(v.2.1.18), SummarizedExperiment(v.1.8.1), zlibbioc(v.1.24.0), plyr(v.1.8.4), grid(v.3.4.4), affxparser(v.1.50.0), blob(v.1.1.1), gdata(v.2.18.0), splines(v.3.4.4), hms(v.0.4.2), knitr(v.1.20), pillar(v.1.2.1), GenomicRanges(v.1.30.3), codetools(v.0.2-15), glue(v.1.2.0), evaluate(v.0.10.1), latticeExtra(v.0.6-28), data.table(v.1.11.4), foreach(v.1.4.4), gtable(v.0.2.0), purrr(v.0.2.4), tidyr(v.0.8.1), assertthat(v.0.2.0), ff(v.2.2-13), tibble(v.1.4.2), iterators(v.1.0.9), AnnotationDbi(v.1.40.0), memoise(v.1.1.0), cluster(v.2.0.6) and bindrcpp(v.0.2.2)